Improved detection of SBDS gene mutation by a new method of next-generation sequencing analysis based on the Chinese mutation spectrum

Next-generation sequencing (NGS) is a useful molecular diagnostic tool for genetic diseases. However, due to the presence of highly homologous pseudogenes, it is challenging to use short-read NGS for analyzing mutations of the Shwachman-Bodian-Diamond syndrome (SBDS) gene. The SBDS mutation spectrum was analyzed in the Chinese population, which revealed that SBDS variants were primarily from sequence exchange between SBDS and its pseudogene at the base-pair level, predominantly in the coding region and splice junction of exon two. The c.258+2T>C and c.185_184TA>GT variants were the two most common pathogenic SBDS variants in the Chinese population, resulting in a total carrier frequency of 1.19%. When analyzing pathogenic variants in the SBDS gene from the NGS data, the misalignment was identified as a common issue, and there were different probabilities of misalignment for different pathogenic variants. Here, we present a novel mathematical method for identifying pathogenic variants in the SBDS gene from the NGS data, which utilizes read-depth of the paralogous sequence variant (PSV) loci of SBDS and its pseudogene. Combined with PCR and STR orthogonal experiments, SBDS gene mutation analysis results were improved in 40% of clinical samples, and various types of mutations such as homozygous, compound heterozygous, and uniparental diploid were explored. The findings effectively reduce the impact of misalignment in NGS-based SBDS mutation analysis and are helpful for the clinical diagnosis of SBDS-related diseases, the research into population variation, and the carrier screening.

types of mutations such as homozygous, compound heterozygous, and uniparental diploid were explored. The findings effectively reduce the impact of misalignment in NGS-based SBDS mutation analysis and are helpful for the clinical diagnosis of SBDS-related diseases, the research into population variation, and the carrier screening.

Introduction
High-throughput, high-sensitivity, and low-cost next-generation sequencing (NGS) have made tremendous progress in the field of clinical molecular testing. Analysis with NGS accurately obtains information on various types of mutation by repeatedly reading the sequence of each base and searching for the optimum matching coordinates in the reference genome [1] . However, when a gene region has one or more pseudogenes with high sequence homology, the short NGS reads cannot be aligned accurately to the target locus, which, in turn, increases the likelihood of false positives or false negatives [2] . Misalignment is more likely to occur, especially when there is a conversion between gene and its pseudogene, presented by the exchange of one or multiple paralogous sequence variants (PSVs) [3] . Therefore, the interference caused by pseudogenes as well as their conversion is one of the key problems with the clinical application of NGS.
The protein encoded by the Shwachman-Bodian-Diamond syndrome (SBDS) gene is involved in ribosomal RNA processing. Since its first publication in 2003, pathogenic variants in the SBDS gene have been detected in more than 90% of people with Shwachman-Diamond syndrome (SDS) [4] . Also known as Shwachman-Bodian-Diamond syndrome, SDS is an autosomal recessive genetic disease involving multiple systems. Its clinical manifestations include exocrine pancreatic dysfunction (chronic diarrhea), skeletal metaphysis development disorder, and different degrees of bone marrow dysfunction accompanied by hypocytosis [5,6] . About one-third of patients suffer from myelodysplastic syndrome and acute myeloid leukemia [7] . Other clinical symptoms include short stature, liver dysfunction, susceptibility to infection, nephrocalcinosis, myocardial necrosis, and neonatal respiratory distress [8,9] . Dozens of SBDS pathogenic variants have been reported to cause SDS, the most common of which include c.185_184TA>GT, c.258+2T>C, and c.[185_184TA>GT;258+2T>C]. A study used targeted analysis for the three common SBDS pathogenic variants and detected at least one pathogenic variant in 90% of affected individuals and both pathogenic variants in approximately 62% of affected individuals who had SDS along with SBDS [4] . Nonetheless, multi-gene testing using NGS is still recommended in the molecular diagnosis of suspected cases of SDS that can be caused by pathogenic variants in other genes besides SBDS. Additionally, the phenotypic spectrum of SDS is broad, and multi-gene testing might be the best option for an individual with atypical phenotypic features [10] .
The SBDS gene has a highly homologous pseudogene, SBDSP1, which is located~5.8Mb downstream of the SBDS gene at chromosome 7q11. The five exon coding sequences of SBDS have 96.8% (95%-98.2%) homology with SBDSP1. Since the common SBDS pathogenic variants c.185_184TA>GT and c.258+2T>C are at the functional PSV loci, it has been suggested that recombination and gene conversion might occur between SBDS and SBDSP1 [11] . Based on the above factors, researchers have recognized the complexity and challenges of NGS analysis of the SBDS gene, and therefore suggest using PCR, RT-PCR or specific bioinformatics tools to increase the sensitivity and specificity of detecting pathogenic variants in SBDS [12][13][14] .
This study focused on analyzing the SBDS mutation spectrum through large-scale whole-exome sequencing (WES) within the Chinese population. After determining the misalignment issues in the NGS data analysis, we tried to overcome these problems by developing a novel mathematical method using NGS data to analyze the pathogenic variants of the SBDS gene.

DNA Samples
To study the SBDS mutation spectrum of the Chinese population and the misalignment of NGS data, we selected 20,542 de-identified samples that were sent to Fulgent (Fujian) Technologies Co., Ltd.(Short for "Fujian Fulgent") for WES by NGS from October, 2019 to March, 2021. Authors could not access to information that could identify individual participants during or after data collection after application. All research participants were from the Chinese population, and their EDTA anti-coagulated venous whole blood was obtained after signing their informed consent.
Samples were divided into non-SDS and suspected SDS groups according to whether the individual's clinical diagnosis included SDS phenotypes, such as chronic diarrhea, cytopenia, or liver dysfunction. The non-SDS group included a total of 20,395 cases, in which none of the above phenotypes were detected. The suspected SDS group contained any two of the above phenotypes and included a total of 147 cases. Since most of the samples sent to Fujian Fulgent were from suspected cases of genetic diseases, the grouping criteria did not include common phenotypes of genetic diseases, such as growth failure and bone abnormalities, but included liver dysfunction, which is reportedly more common in people from China with SDS. According to the manufacturer's instructions, the genomic DNA was extracted from whole blood using commercially available DNA isolation kits. This study received approval from the ethics committee of No. 900 Hospital of the Joint Logistics Team.
Samples from the same capture pool were grouped together. The raw sequence data were aligned to the human reference genome GRCh37/hg19 using Novoalign v3.09.00 software (Novocraft Technologies, Malaysia). SNPs and short indels were called using SAMtools v1.3.1 and VarScan software v2.3.9. Variants were annotated using Alamut-Batch standalone v1.9 software (Interactive Biosoftware, France). The average read-depth of each exon of SBDS and the corresponding region of SBDSP1, as well as the read-depth of their PSVs loci, were extracted using bedtools coverage (v.2.30.0). Variants were curated manually according to the American College of Medical Genetics (ACMG) scoring system [15] .

Sanger sequencing
The primer sequences provided by Woloszynek were employed to amplify the five exons, splice sites, and intron sequences flanking the exons of SBDS [16] . Amplicons were generated using 5 ng of genomic DNA using Ex Taq polymerase (Takara, China). The PCR conditions for this amplification were as follows: 95°C for 5 min, 35× (95°C for 30 s and 60°C for 15 min), 72°C for 5 min. Products were verified by agarose gel electrophoresis then sequenced using the sequencing primers on a 3730 genetic analyzer (Applied Biosystems, USA).

STR typing
In the parentage test, samples DNA were detected with Microreader™ 21 ID System (Suzhou Microread Genetics, Suzhou, China), according to the standard procedures of the manufacturer.

Results and Discussions
Previous studies have speculated that SBDS pathogenic variants may be converted from SBDSP1 because these pathogenic variants often occur at the functional PSV loci of SBDS and sometimes include other cis variants at PSV loci [11] . Therefore, research has sought to determine if the formation of SBDS pathogenic variants occurs through gene conversion and how frequently this conversion takes place. This information is very important for formulating NGS data analysis strategies for SBDS genes and validating the accuracy of the results [2] . This study aimed to acquire this information through large-scale WES NGS data analysis. In general, the average read-depth ratio of a gene to its pseudogene will significantly shift in genes or regions where gene conversion is more frequent, such as exons 13-15 of the PMS2 gene [17] . The average read-depths of each exon of SBDS and the corresponding region of SBDSP1 were extracted to estimate the frequency of conversion between SBDS and SBDSP1. Subsequently, the read-depth ratios of SBDS to SBDSP1 were calculated of 2000 WES samples, which were randomly selected from the Chinese population in the non-SDS group. Theoretically, when SBDS is converted to SBDSP1, the read-depth of SBDS becomes lower than that of the corresponding SBDSP1 region, thereby leading to a decrease in the ratio of SBDS to SBDSP1. When the opposite conversion occurs, the ratio increases. In this study, the average read depth of each exon of SBDS was not significantly different from that of the corresponding region of SBDSP1 (Figure 1), indicating that the SBDS NGS data can usually be accurately aligned, while gene conversion of SBDS and SBDSP1 in exon-level or in a larger range is not common.
Further analysis aimed to determine if the interlocus conversion between the SBDS and SBDSP1 genes occurred at the sub-exon level [18,19] . Since this conversion was manifested as the exchange of PSVs in NGS analysis, we tried to obtain the SBDS mutation spectrum from WES data of 20,395 samples from the Chinese population. A total of 64 different SBDS variants were discovered, of which seven were detected more than ten times (Supplement Table 1). Among them, c.141C>T, c.258+2T>C, and c.201A>G were the top three most frequently detected variants, with carrier frequencies of 4.16%, 1.01%, and 0.61%, respectively, and all were at PSV loci of SBDS.
The total carrier frequency of ten variants at PSV loci of SBDS was 6.18%, which was significantly higher than other fifty-four at non-PSV position (1.23%). In addition, through analyzing the NGS data of SBDSP1, the frequency of these common variants in SBDS were found to vary in accordance with variants in the corresponding SBDSP1 loci (P < 0.05), but did not results from the reciprocal sequence exchanges because there was no significant correlation between their occurrence (P>0.05) (Supplement Table 2). This observation indicated that the variants of SBDS mainly originated from the gene-conversion event, the frequency of which was very low in the Chinese population. Furthermore, we observed that most variants at PSV loci of SBDS were detected in isolation, and only 35 samples were found to have in-cis variants at adjacent PSV loci, all of which were in exon two and its flanking sequence of the SBDS gene.
According to the distance between PSVs in the SBDS coding region, it can be inferred that sequence conversions between SBDS and SBDSP1 mainly occurred at the base pairs level, and longer conversions at the sub-exon level only occur in the Exon two region. Four variants were found that could be classified as pathogenic or likely pathogenic. The c.258+2T>C variant was detected most frequently (207 cases), followed by c.185_184TA>GT (33 cases), while the other two were null mutations which were only detected once (Table 1 and Supplement Table 1). The total carrier frequency of SBDS pathogenic variants in the Chinese population was 1.19%. According to Hardy-Weinberg equilibrium, the theoretical incidence of SBDS-related SDS in the Chinese population was 3.52 per 100,000, which was higher than that of other populations (0.5-1.5/100,000) [20] . This is the first study to analyze the carrier frequency of SBDS pathogenic variants in the Chinese population with large sample size. SDS is a severe early-onset disease, and the overall carrier frequency of SBDS gene pathogenic variants in the Chinese population is >1% [21] . Therefore, this study provided evidence for the SBDS gene as a candidate for screening potential carriers. Regarding the spectrum of pathogenic variants in the population, analyzing the common hotspot mutations, such as c.258+2T>C and c.185_184TA>GT, could effectively screen out most of the carriers.
Since PSVs could be informative for NGS read alignment for highly homologous genes, the more variants at SBDS PSVs loci occurred in a given NGS read, the more likely it was to be misaligned to SBDSP1, and the higher was the risk of false negatives. In the NGS data with 207 c.258+2T>C and 33 c.185_184TA>GT, it was found that most c.258+2T>C were detected in isolation (91.3%), and only a few detected in-cis with a variant at the c.201A PSV (8.7%). On the other hand, c.185_184TA>GT had a higher proportion (51.5%) detected in-cis with variants at the c.141C and/or c.201A PSVs, suggesting that NGS data reads containing c.185_184TA>GT have a greater chance of misalignment (table). In our study, the allele balance of c.258+2T>C and c.185_184TA>GT both had a significant shift. Especially, nearly half of c.185_184TA>GT variants had allele balance < 0.3, the majority of which were arranged in-cis with other variants at PSV loci (Figure 2). Allele balance bias made it difficult to distinguish whether the mutation reads were chimeric or misaligned. Furthermore, allele balance bias is an important quality control filter of NGS data. A pathogenic variant might be filtered if the allele balance is below the limit (usually 10%). Consequently, misdiagnosis and missed diagnosis were likely to occur, especially for individuals with atypical SDS symptoms. These two pathogenic variants were even excluded from The Exome Aggregation Consortium (ExAC) database due to failure to pass the variant quality score recalibration (VQSR) filter. It is worth noting that the common c.[185_184TA>GT; 258+2T>C] variant was not found in the Chinese population [22] , which was inconsistent with the theoretically predicted carrier frequency. It was speculated that the SBDS NGS reads containing this variant were completely aligned to the SBDSP1 locus due to the exchange of multiple PSVs, suggesting that the NGS analysis on c.[185_184TA>GT;258+ 2T>C] had limitations. Hence, it is necessary to establish a more effective analysis process to resolve the interference caused by the misalignment of NGS data in the analysis of SBDS variation.
In view of the above analysis of the SBDS variant and pathogenic variant spectrum in the Chinese population, we have found that the pathogenic variants of SBDS were mainly at functional PSV loci caused by short sequence exchange in the coding region and splice junction of exon two. Therefore, we formulated an NGS data processing procedure for SBDS in this particular region to overcome the NGS misalignment problem. The primary strategy comprised the generation of a mathematical relationship between the SBDS:SBDSP1 read-depth ratio (RR) of each PSV locus and the total SBDS+SBDSP1 read-depth (TR) of each PSV locus. As shown in Figure 3, with wild-type SBDS and SBDSP1, NGS reads could be accurately aligned to the correct locus, and the RR values of all PSV loci were about 1 (between 0.8-1.2). When one or multiple PSVs of SBDSP1 appeared in SBDS reads, a portion of SBDS reads was misaligned to SBDSP1, and the RR values of PSV loci were reduced (< 1). At this time, variants with a low allele balance could be called at PSV loci of the SBDS gene, and variants with a low allele balance adjacent to the corresponding PSV loci of SBDSP1 could also be called. The TR value of PSVs loci was not significantly different from that of other WES data in the same capture pool. When there were more adjacent PSVs of SBDSP1 in SBDS reads, more SBDS reads were misaligned to SBDSP1, and the RR values of multiple PSV loci decreased (as low as 1/3). We could not identify mutations in SBDS or SBDSP1, but the TR value of PSVs loci was not significantly different from that of other WES data in the same capture pool. In addition, when the individual carrying a heterozygous deletion of the SBDS gene, the RR values of multiple PSV loci were also reduced (~1/2), but the TR value was lower than that of other WES data in the same capture pool, which was different from the misalignment. The novel strategy had the potential to find misalignment events in SBDS NGS analyses and provides an additional opportunity to manually check NGS data and apply orthogonal methods.  Due to the misalignment of NGS data, SBDS variants sometimes have abnormal allele balances, which would then manifest as unexpected segregation patterns or be misunderstood as somatic mosaicism. However, abnormal genetic events like these are really rare in reported cases. This was also supported by the fact that all SBDS variants from the suspected SDS group were inherited from heterozygous parents, proved by orthogonal methods. In particular, we found that one case (case 3) had a homozygous c.258+2T>C variant only inherited from a heterozygous father. The WES data showed normal RR and TR values at the SBDS PSV loci, which ruled out the possibility of large fragment DNA deletion or reads misalignment in the SBDS gene ( Figure 8 and 9). Further, a multiplex STR testing was done to explain this unexpected pattern of segregation, and the results showed that the homozygous c.258+2T>C variant was caused by the paternal uniparental disomy of chromosome seven ( Figure 10). This is the first report of uniparental disomy caused a homozygous pathogenic variant in SDBS. It is suggested that our strategy for SBDS NGS analyses provided an additional opportunity to apply more appropriate orthogonal methods.

Conclusion
Our proposed NGS variant data analysis method, based on read-depth analysis of the PSV loci of SBDS and its pseudogene, could effectively overcome the misalignment problem in SBDS NGS data analysis and improve detection rates of SBDS pathogenic variants. Based on the Chinese mutation spectrum of SBDS, the novel method not only focused on the most common variants in the SBDS gene exon 2, but also could be used for the detection of structural variation (SV) in SBDS. Furthermore, the method was useful for performing more efficient molecular diagnosis, the variant spectrum study, and large-scale carrier screening of SDS.  The median read-depth ratio of SBDS to SBDSP1 among 2000 random samples in non-SDS group.
The error bars represent the standard error of the median read-depth ratio.